Skip to content

DO NOT MERGE - CI sandbox for stateless scheduler b#25172

Open
fzyzcjy wants to merge 57 commits into
sgl-project:mainfrom
fzyzcjy:feat/stateless_scheduler_b
Open

DO NOT MERGE - CI sandbox for stateless scheduler b#25172
fzyzcjy wants to merge 57 commits into
sgl-project:mainfrom
fzyzcjy:feat/stateless_scheduler_b

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented May 13, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): Not run yet
Latest PR Test (Extra): ⚠️ Not run on latest push -- push again to dispatch.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 13, 2026

/tag-and-rerun-ci

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the chunked prefill mechanism by replacing the global chunked_req pointer with per-request state flags (has_pending_chunk and pending_middle_outputs). This change enhances support for pipeline parallelism and ensures more robust state management across iterations. The review feedback identifies potential null pointer crashes, logic inconsistencies in request abort handling, and performance optimizations for hot-path queue scans, all of which include actionable code suggestions.

Comment on lines +3583 to +3586
for mb_list in (self.mbs, self.last_mbs, self.running_mbs):
for mb in mb_list:
if mb is not None and not mb.is_empty():
batch_reqs.extend(mb.reqs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The iteration over self.mbs, self.last_mbs, and self.running_mbs will crash if any of these attributes are None. While self.mbs is typically a list, last_mbs and running_mbs are often None in certain scheduler states or configurations.

Suggested change
for mb_list in (self.mbs, self.last_mbs, self.running_mbs):
for mb in mb_list:
if mb is not None and not mb.is_empty():
batch_reqs.extend(mb.reqs)
if self.pp_size > 1 and hasattr(self, "mbs"):
for mb_list in (self.mbs, self.last_mbs, self.running_mbs):
if mb_list is not None:
for mb in mb_list:
if mb is not None and not mb.is_empty():
batch_reqs.extend(mb.reqs)

Comment on lines +3592 to +3594
if (recv_req.abort_all or req.rid.startswith(recv_req.rid)) and (
req.rid not in batch_rids
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Aborted requests that are currently in a batch (e.g., chunked-resume requests) should still be removed from the waiting_queue list to maintain a consistent scheduler state. The current logic skips them entirely. To avoid double-releasing resources, you should remove them from the list but skip the release_kv_cache call inside the processing loop (by checking if req.rid in batch_rids: continue).

            if (recv_req.abort_all or req.rid.startswith(recv_req.rid)):

Comment thread python/sglang/srt/managers/scheduler.py Outdated
# priority + has_pending_chunk make it sit at the head, but its
# presence relaxes the "is queue empty / pool full" early exits below
# (we must keep scheduling it to make progress, or memory leaks).
has_chunked_resume = any(r.has_pending_chunk for r in self.waiting_queue)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Performing an $O(N)$ scan of the waiting_queue using any() in the scheduler's hot path is inefficient. Since the scheduling policy ensures that has_pending_chunk requests are sorted to the front of the queue, you can optimize this by checking only the first element.

Suggested change
has_chunked_resume = any(r.has_pending_chunk for r in self.waiting_queue)
has_chunked_resume = self.waiting_queue[0].has_pending_chunk if self.waiting_queue else False

Comment thread python/sglang/srt/managers/scheduler.py Outdated
Comment on lines +2664 to +2666
chunked_resume = next(
(r for r in self.waiting_queue if r.has_pending_chunk), None
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This $O(N)$ scan can be optimized. Since has_pending_chunk requests are sorted to the front of the queue, the first element is the only one that needs to be checked.

            chunked_resume = self.waiting_queue[0] if self.waiting_queue and self.waiting_queue[0].has_pending_chunk else None

Comment thread python/sglang/srt/managers/scheduler.py Outdated
@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 13, 2026

/rerun-failed-ci

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 13, 2026

/rerun-test stage-b-test-2-gpu-large

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

stage-b-test-2-gpu-large: No test file found matching stage-b-test-2-gpu-large under test/registered/ or python/sglang/multimodal_gen/test/.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 13, 2026

/rerun-failed-ci

5 similar comments
@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 13, 2026

/rerun-failed-ci

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 13, 2026

/rerun-failed-ci

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 13, 2026

/rerun-failed-ci

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 13, 2026

/rerun-failed-ci

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 14, 2026

/rerun-failed-ci

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 14, 2026

stage-c-test-dsv4-8-gpu-h200 failures — infra flakes, not v2-related

The dsv4-8-gpu-h200 job has been failing on this PR's reruns. After digging into the logs, all three observed failure modes occur in setUpClass / server startup, before any v2 code path is exercised.

Attempt 3 — leftover process holds NCCL port

log

ValueError: nccl_port at 36919 is not available in 30 seconds.
nccl_port is used by a process already.
process.cmdline=['python3', '-m', 'sglang.launch_server',
                 '--disaggregation-mode', 'prefill', ...]
process.status()='running' pid=3168509

The PID is from a prior job on the same self-hosted runner that wasn't cleaned up between attempts.

Attempt 4 — two distinct distributed-init failures

log

test_disaggregation_dsv4.py — NVSHMEM UID bootstrap times out:

bootstrap.cpp:242: non-zero status: 7 bootstrap_loader_init returned error for mode UID
init.cu:1188: non-zero status: 7 nvshmem_bootstrap failed
nvshmemx_api.h:63: non-zero status: 7: Connection timed out, exiting...
[2026-05-13 22:48:20] Received sigquit from a child process. It usually means the child failed.
EOFError

test_deepseek_v4_flash_fp8_h200.py — server child SIGKILL during setUpClass:

File "test/registered/dsv4/test_deepseek_v4_flash_fp8_h200.py", line 39, in setUpClass
    cls.process = popen_launch_server(...)
Exception: Server process exited with code -9. Check server logs for errors.

Why this isn't v2

  • All failures occur in setUpClass (port acquisition / NVSHMEM bootstrap / server warmup). The test bodies never run, so no scheduler / chunked-prefill code is invoked.
  • This PR's changes are confined to chunked-prefill state-machine migration (per-Req fields, PP cross-microbatch finalize guard) and v1-test cleanup. None of it touches NCCL port allocation, NVSHMEM init, or server bootstrap.
  • The same NVSHMEM-UID-timeout / port-busy patterns recur across stage-c h100 / h200 / h20 lanes on unrelated PRs.

Cascade impact

dsv4-8-gpu-h200 is a fast-fail root, so each failure cascades to ~10 downstream jobs that skip with:

##[error]Fast-fail: skipping — root cause job(s): stage-c-test-dsv4-8-gpu-h200

Those downstream skips (deepep-8-gpu-h200, 4-gpu-h100, 4-gpu-b200 (0), etc.) are not independent failures.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 14, 2026

stage-c-test-4-gpu-b200 (3) failure — pre-existing NVFP4 weight-update bug, not v2-related

Job: stage-c-test-4-gpu-b200 (3), attempt 5 — log

Failing test: test/registered/rl/test_update_weights_from_disk_blackwell.py::test_parameterized_update_weights_from_disk

Error (server-side, during update_weights_from_disk RPC):

File "python/sglang/srt/layers/quantization/modelopt_quant.py", line 1361, in process_weights_after_loading
    input_scale_2 = layer.input_scale.max().to(torch.float32)
File "torch/nn/modules/module.py", line 1968, in __getattr__
    raise AttributeError(
AttributeError: 'QKVParallelLinear' object has no attribute 'input_scale'. Did you mean: 'input_size'?

Stack trace (server-side):

File "python/sglang/srt/managers/scheduler.py", line 4106, in run_scheduler_process
File "python/sglang/srt/managers/scheduler.py", line 1850, in process_input_requests
File "python/sglang/srt/managers/scheduler_update_weights_mixin.py", line 56, in update_weights_from_disk
    success, message = self.tp_worker.update_weights_from_disk(recv_req)
…
File "python/sglang/srt/layers/quantization/modelopt_quant.py", line 1361, in process_weights_after_loading

The server then SIGQUITs and the test client sees Connection aborted / RemoteDisconnected while polling _post_json, which is the cascade — the underlying root cause is the AttributeError above.

Why this isn't v2:

  • The crash is in python/sglang/srt/layers/quantization/modelopt_quant.py:1361 (NVFP4 / modelopt quantization path on Blackwell). v2 doesn't touch quantization or weight-update flows.
  • This PR's diff in the scheduler is confined to the chunked-prefill state machine (per-Req fields, PP cross-microbatch finalize guard) and scheduler_update_weights_mixin.py is unmodified.
  • QKVParallelLinear is constructed without an input_scale attribute under this NVFP4 codepath; v2 doesn't change layer construction.

This looks like a pre-existing modelopt-quant bug in the NVFP4 weight-reload path on Blackwell — orthogonal to this PR.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 14, 2026

stage-c-test-4-gpu-h100 (1) failure — fast-fail cascade, not a real run

Job: stage-c-test-4-gpu-h100 (1), attempt 5 — log

This job did not actually run any tests; it was skipped by the fast-fail gate:

##[error]Fast-fail: skipping — root cause job(s): stage-c-test-4-gpu-b200 (3)

The root cause is the pre-existing NVFP4 weight-update bug in b200 (3) — see the b200 (3) analysis above. No h100-specific signal here.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 14, 2026

CI status snapshot (pre-rebase)

Current attempt 5 results: 107 success / 9 failure / 16 skipped / 2 still queued.

Per-failure classification (CUDA-lane only; AMD/NPU skipped per project policy):

Job Conclusion Why Comment
stage-c-test-4-gpu-b200 (3) failure Pre-existing NVFP4 weight-update bug (QKVParallelLinear.input_scale AttributeError in modelopt_quant.py:1361). Fixed on main by #25190 (commit 6c0633b0b1, 2026-05-13) — not in this branch's merge-base. analysis
stage-c-test-4-gpu-h100 (1) failure Fast-fail cascade from b200 (3). Did not run any test. analysis
stage-c-test-dsv4-8-gpu-h200 success (this attempt) Earlier attempts 3 & 4 hit chronic infra flakes (NCCL port leftover / NVSHMEM UID bootstrap timeout) — both before any v2 code path is exercised. analysis

Plan: rebasing feat/stateless_scheduler_b onto latest upstream/main to pick up #25190 + other recent fixes, then triggering a fresh CI run. The rebase should resolve the only real failure (b200 (3)); the cascade and infra flakes will resolve as a side effect.

fzyzcjy added 2 commits May 14, 2026 10:50
When chunked-resume reqs are held in both waiting_queue and batch.reqs
(stateless-scheduler refactor), abort_request would otherwise process
them twice (queue pop + to_finish), causing duplicate send_output and
double release_kv_cache. Build batch_rids upfront and skip waiting_queue
removal for reqs already in batch — let to_finish path handle them.

Pre-flight for stateless-scheduler v2.
For chunked-resume reqs (after the upcoming stateless-scheduler switch)
that live in waiting_queue with non-empty prefix_indices, summing
req.seqlen overcounts the committed prefix. Switch to seqlen - prefix
for waiting reqs; keep the chunked_req block until that field is removed.

Today's behavior is unchanged for fresh waiting reqs whose prefix_indices
is empty.

Pre-flight for stateless-scheduler v2.
@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

CUDA extra-a-test-1-gpu-large (1) failure — FlashAttention SM90 illegal address during CUDA graph capture

Job log

Failing test: test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

Root cause:

[coredump] Detected an exception of type CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS (14)
  - Kernel: cutlass_kernel_flash_attncuteflash_fwd_sm90FlashAttentionForwardSm90...
  - Site: lora/layers.py:724 -> quantization/unquant.py:161
Fatal Python error: Aborted (during CUDA graph capture, bs=256, avail_mem=12.04 GB)

Crash is inside the CUTLASS-CUTE FlashAttention SM90 kernel during cuda-graph capture for LoRA + Qwen3-8B. Our diff is scheduler-side chunked-resume bookkeeping with no LoRA, no attention-kernel, and no cuda-graph path changes. Test file is pre-existing on main (last touched by #24725 / #25197), not introduced by our merge.

extra-a-test-1-gpu-large (2) is a fast-fail cascade post-cleanup shadow of the same workflow — no actual test ran.

Posting /rerun-test to confirm flake vs reproducible.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

🚀 1-gpu-h100 (1 test): ❌ View workflow run

cd test/ && python3 registered/lora/test_lora_qwen3_8b_logprob_diff.py

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

LoRA Qwen3-8B extra-a-test-1-gpu-large CUDA fail — pre-existing flake on main

Update: my /rerun-test reran the file and it failed again with the same CUDBG_EXCEPTION_WARP_ILLEGAL_ADDRESS in FlashAttention SM90 CUTE during CUDA graph capture.

Cross-branch evidence: this is a pre-existing high-rate flake on main, not introduced by this PR. Of the last 19 rerun-test runs of test_lora_qwen3_8b_logprob_diff.py on the repo (last ~14h):

  • 9 failures
  • 10 successes

That's a ~47% flake rate at the SM90 FlashAttention kernel layer, well before this PR's merge. My PR's diff is scheduler-side chunked-resume bookkeeping; no LoRA/attention/cuda-graph code paths touched.

Posting one more /rerun-test in case it lands on the success side this cycle. If it still fails this is environmental and not blocking for the merge.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

/rerun-test test/registered/lora/test_lora_qwen3_8b_logprob_diff.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

🚀 1-gpu-h100 (1 test): ❌ View workflow run

cd test/ && python3 registered/lora/test_lora_qwen3_8b_logprob_diff.py

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 19, 2026

AMD 2-GPU stage-b-test-2-gpu-large-amd failure — Mixtral/aiter MoE CUDA graph hang

Job log

Watchdog-triggered scheduler hang during CUDA graph capture for mistralai/Mixtral-8x7B-Instruct-v0.1 (tp_size=2, attention_backend=aiter, MoE via aiter/fused_moe.py:147):

[TP1] Pyspy failed (py-spy dump --native --pid 1464). Error: Failed to get stack traces
(repeats every 5 min, watchdog_timeout=300)

py-spy main thread stuck inside:
  fused_moe (aiter/fused_moe.py:147) -> ck_moe_stage1_fwd (aiter/ops/moe_op.py:555)
  -> run_moe_core -> mixtral.py:115 forward
  ... within cuda_graph_runner.capture

Hang in AMD aiter MoE kernel during CUDA graph capture. Not CUDA, AMD MI300 only. Our diff is scheduler-side chunked-resume bookkeeping with no aiter, no MoE, no AMD-specific paths. Cascaded fast-fails (wait-for-stage-b-amd, pr-test-amd-finish) trigger from this. Not blocking for merge.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 20, 2026

AMD MI300 lane: two near-threshold perf assertion flakes — not ours, not blocking

Run on 5cc1a41d0d.

Job Test Failure Threshold
stage-b-test-1-gpu-small-amd (1) test/registered/moe/test_torch_compile_moe.py 235.7 < 240 (2% short) "must be >= 240"
stage-b-test-1-gpu-large-amd (1) test/registered/perf/test_bench_serving_1gpu_part2.py 81.0 > 80 then 90.8 > 80 "must be < 80"

Both already retried internally once and failed twice; classic AMD MI300 hardware-noise perf threshold flake territory. Our diff is scheduler-side chunked-resume bookkeeping with no AMD / MoE / serving-perf code paths touched. wait-for-stage-b-amd cascade fails from these. Not blocking the CUDA gate.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 20, 2026

Non-CUDA lane failures on f531ac917b — all hardware/perf, none ours

Run (AMD) + Run (NPU).

Classification (CUDA lane is still all-green; see rollup):

Lane Job Cause
AMD MI300 stage-b-test-1-gpu-small-amd (3) HW Exception by GPU node-2 ... reason: GPU Hang (hardware)
AMD MI300 stage-b-test-1-gpu-small-amd (6), (9) likely same GPU-hang signature
AMD MI35x stage-b-test-1-gpu-small-amd-mi35x, stage-b-test-large-8-gpu-...-disaggregation-amd, stage-c-test-large-8-gpu-amd-mi35x (0), (1) MI35x lane failures
AMD MI300 stage-c-test-4-gpu-amd (0), stage-c-test-large-8-gpu-amd (1) non-blocking AMD stage-c
NPU stage-b-test-16-npu-a3 test_npu_deepep.py failed in NPU-specific code path
NPU pr-test-npu-finish meta cascade

None plausibly caused by this PR's diff (scheduler-side chunked-resume bookkeeping with no AMD, no NPU, no deep-EP, and no kernel-level paths). Per sglang-babysit-ci skill: non-CUDA lanes are only fixed when clearly ours AND easy. Not blocking the CUDA gate.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 20, 2026

/rerun-test test/registered/8-gpu-models/test_deepseek_v32_indexcache.py

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 20, 2026

extra-b-test-8-gpu-h200 (1) CUDA OOM — DeepSeek-V3.2 server restart leak between two sub-tests

Job log.

Failing test: test/registered/8-gpu-models/test_deepseek_v32_indexcache.py (DeepSeek-V3.2, tp_size=8, NSA attention).

Pattern:

Test 1 (mem_fraction_static=0.897, index_topk_freq=4) ran -> server torn down
Test 2 (mem_fraction_static=0.881, index_topk_pattern=...) server init:
  RuntimeError: CUDA out of memory. Tried to allocate 5.74 GiB.
  GPU has 139.80 GiB capacity, 5.33 GiB free.
  Process 3553102 has 134.46 GiB memory in use.
  (PyTorch allocator: 124.84 GiB, plus 896 MiB in CUDA Graphs private pool,
   plus 6.32 GiB reserved-but-unallocated)

This is the well-known CUDA allocation not freed between sub-tests pattern. 134 GiB lingering allocation from sub-test 1's process means sub-test 2's init can't claim its mem_fraction_static budget.

Our diff is scheduler-side chunked-resume bookkeeping; no allocator, kv pool tear-down, multiprocessing, or DSv3.2 model code paths touched. Posting /rerun-test to confirm flake.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

🚀 8-gpu-h200 (1 test): ✅ View workflow run

cd test/ && python3 registered/8-gpu-models/test_deepseek_v32_indexcache.py

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 20, 2026

/rerun-test test/registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 20, 2026

base-c-test-dsv4-4-gpu-b200 (0) CUDA fail — CUTLASS DSL / flashinfer binding TypeError, not ours

Job log.

Failing test: test/registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py (DeepSeek-V4, B200, FP4 mega-MoE).

Root cause: Python TypeError raised during cute.compile of flashinfer's rmsnorm kernel:

File ".../flashinfer/norm/kernels/rmsnorm.py", line 246, in kernel
    tidx, _, _ = cute.arch.thread_idx()
File ".../cutlass/_mlir/dialects/_nvvm_ops_gen.py", line 9830, in __init__
    super().__init__(self.OPERATION_NAME, ...)
TypeError: __init__(): incompatible function arguments.
  Supported: __init__(self, operation: object) -> None
  Invoked with: (ThreadIdXOp, str, tuple, NoneType, NoneType, kwargs={attributes:dict, results:list, operands:list, ...})

ThreadIdXOp constructor signature mismatch between the installed cutlass-dsl and what flashinfer.norm.kernels.rmsnorm is calling. Dependency binding bug, surfaces during CUDA-graph capture for DSv4 q_norm.

Our diff is scheduler-side chunked-resume bookkeeping; no flashinfer, no CUTLASS DSL, no rmsnorm, no DSv4 model code paths touched. Last successful rerun-test of this same file ~14h ago was on a different SHA, so it's plausibly a recent main env/dep drift.

Posting /rerun-test to confirm flake vs reproducible.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

🚀 4-gpu-b200 (1 test): ✅ View workflow run

cd test/ && python3 registered/dsv4/test_deepseek_v4_flash_fp4_megamoe_b200.py

fzyzcjy added 12 commits May 25, 2026 15:29
schedule_batch.py: drop self.maybe_wait_verify_done() call in merge_batch —
  upstream removed verify_done.wait via FutureMap routing (sgl-project#25879); keep our
  branch's assert against chunked/dllm reqs in other.reqs.
test/registered/unit/managers/test_scheduler_chunked_req_gate.py: keep
  HEAD's deletion (v1 gate removed in v2); upstream's array.array
  migration is moot since the file goes away.
Adds a by-rid dict tracking sync-mode reqs scheduler currently owns
the lifecycle of (admitted, not finished, not retracted). Runs as a
parallel tracker alongside existing waiting_queue / running_batch
without changing scheduler behavior. DEBUG_INVARIANTS=1 enables
_assert_invariants checks at get_next_batch_to_run boundaries.

Part of waiting_queue refactor plan, commit 1/7. See agent-drafts/
2026-05-25-waiting-queue-refactor-plan.md.
Eliminates H3 hack (Stage A scanning the full waiting_queue to find
chunked-resume reqs). Now scans the chunked_reqs() view derived from
active_reqs. Behavior identical to C1 because C1's retention keeps
waiting_queue and active_reqs in sync for chunked-resume reqs.

Part of waiting_queue refactor plan, commit 2/7.
…(C3)

Adds an inline chunked admission block at the top of
_get_new_batch_prefill_raw that consumes chunked_reqs() directly.
Strips has_pending_chunk branches from the main waiting_queue loop
(H6 LoRA drainer bypass, H7 init_next_round_input split). The
waiting_queue retention for chunked-resume is still in place; it is
removed in C4. Single-flight assertion enforced at the inline
admission entry.

Part of waiting_queue refactor plan, commit 3/7.
Chunked-resume reqs no longer anchor in waiting_queue (H2 hack
elimination). The retention `or x.has_pending_chunk` is removed; the
transitional guard added in C3 to prevent double-admit is also
removed. After this commit, chunked-resume reqs live exclusively in
active_reqs and are re-admitted via the inline block at the top of
_get_new_batch_prefill_raw.

Part of waiting_queue refactor plan, commit 4/7.
…d bypasses (C5)

Now that chunked-resume reqs live in active_reqs (post-C4), the
defensive bypasses that scanned waiting_queue for has_pending_chunk
become dead code. Eliminates H4 (early-exit has_chunked_resume scan),
H5 (dynamic-chunking lookup), AB7 (_abort_on_waiting_timeout
has_pending_chunk skip), plus a stale comment referencing the deleted
retention. Single chunked_in_active computation reused throughout
_get_new_batch_prefill_raw.

Part of waiting_queue refactor plan, commit 5/7.
Eliminates H1 (dual-existence comment) and H8 (defensive
has_pending_chunk / pending_middle_outputs reset on waiting-segment
orphan release). Post-C4 chunked-resume reqs no longer live in
waiting_queue, so the waiting-segment orphan branch is narrowed to
mamba-pool reqs only.

Critical: the active-segment loop now iterates active_reqs instead of
batch_reqs, distinguishing in-batch reqs (FINISH_ABORT via batch
result path) from stashed chunked-resume reqs (immediate release +
_deactivate, audit finding 2). Without this, aborting a chunked-
resume mid-prefill outside of any current batch would leak
row + KV + lock_ref.

Part of waiting_queue refactor plan, commit 6/7.
- Tightens _assert_invariants: waiting_queue and active_reqs are now
  strictly disjoint (sync mode); C1's relaxed transitional clause
  removed.
- Removes C1's _activate idempotency filter at the admission call
  site; the main admission loop no longer produces re-admits after
  C3/C4.
- Adds comprehensive invariant documentation as field-level comments
  on Scheduler.active_reqs and method docstring on chunked_reqs().
- Migrates pause_generation(retract) chunked release path to iterate
  chunked_reqs() instead of scanning waiting_queue (dead post-C4),
  and flags a pre-existing latent bug (req not re-enqueued after
  reset_for_retract).

Concludes the waiting_queue refactor chain (commit 7/7). See
agent-drafts/2026-05-25-waiting-queue-refactor-plan.md and audit.
Review of C1-C7 revealed two P0 bugs and one P1:

1. _activate fired unconditionally in _get_new_batch_prefill_raw,
   enrolling disagg PREFILL and DLLM reqs into active_reqs. Neither
   path has a corresponding _deactivate (disagg PREFILL uses
   process_batch_result_disagg_prefill; DLLM uses dllm/mixin paths),
   leaking active_reqs entries indefinitely and crashing abort_all
   via the new stashed-chunked assert (C6).

2. flush_cache cleared tree cache / pool but not active_reqs,
   leaving stale dict entries pointing at freed req_pool_idx.

Fix: gate _activate at the helper itself (single point of control)
to enforce the "sync-mode non-DLLM only" invariant that the plan +
audit always assumed but code didn't enforce. flush_cache.clear()
ensures the dict is reset alongside other ownership pools.

Also: rewrite two stale comments referencing pre-C4 waiting_queue
retention.

Part of waiting_queue refactor chain, commit 8/7 (post-review fix).
…(C9)

C3 inlined the body of upstream's `PrefillAdder.add_chunked_req` into
`_get_new_batch_prefill_raw` to avoid resurrecting the special method.
But `add_one_req` already supports chunked-resume via its `is_resume`
path (`has_pending_chunk and not is_dllm`), which gates:
- budget_prefix=0 (no prefix double-count)
- skip _req_inc_lock_ref (already held from prior admission)
- update has_pending_chunk = truncated

So the inline manual budget code was a copy of logic that
`add_one_req` already encapsulates. C9 replaces the ~30-line inline
block with a single `adder.add_one_req(chunked_req, ...)` call;
chunked admission still runs BEFORE the main waiting_queue loop so it
skips LoRA drainer / hicache prefetch checks that don't apply to
in-flight chunked.

Removes scheduler.py access to PrefillAdder protected methods
(`_get_dllm_remain_tokens`, `_update_prefill_budget`) — these stay
encapsulated. Behavior change: `prefill_delayer_single_pass` /
`prefill_max_requests` / `dsa_prefill_cp_in_seq_split` early-exit
gates now apply to chunked too. Safe in practice: chunked runs first
so can_run_list is empty for `_max_requests` / `cp_in_seq_split`
checks; prefill_delayer blocking chunked just delays one iter.

Part of waiting_queue refactor chain, commit 9/7.
…LL leak (C10)

Two motivations:

1. BUG: C8's `_activate` gate excluded ALL disagg modes, but disagg
   PREFILL shares _get_new_batch_prefill_raw with sync — chunked-resume
   reqs were admitted, then orphaned (out of waiting_queue per C4, not
   in active_reqs per C8), leaking row + KV + lock_ref. Fix: gate to
   DECODE only (which has its own prealloc/transfer queue ownership),
   then wire _deactivate at disagg/prefill.py's three release_kv_cache
   sites and migrate its Stage A loop to chunked_reqs().

2. CLEANUP: post-C4 chunked-resume never lives in waiting_queue, but
   several supporting files still split waiting_queue by
   has_pending_chunk (schedule_policy.py 3 sites,
   pool_stats_observer.py, invariant_checker.py, several stale
   comments). Revert/migrate to read active_reqs.

DECODE mode is still excluded from active_reqs (Q1=(c)); only PREFILL
is now correctly tracked.

Part of waiting_queue refactor chain, commit 10/7.
C10 narrowed _activate's gate to DECODE-only, so disagg PREFILL
chunked-resume reqs now enter active_reqs and can be reached by the
abort_request active段 stashed-chunked branch (C6). But that branch
only does release_kv_cache + _deactivate — missing two pieces of
disagg PREFILL cleanup that pause_generation(retract) does correctly:

1. disagg_kv_sender.abort() — without this, the peer decode node
   waits forever for the remaining chunks (hang).
2. release_req_to_metadata_buffer() — metadata buffer slot leak.

Mirrors pause_generation(retract) PREFILL handling and abort_request
waiting段 PREFILL handling.

Also: clean stale "assert above" comment in disagg/decode.py
(the assert was deleted in C10).

Part of waiting_queue refactor chain, commit 11/7.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant